Elimination of splitting errors in printed Bangla scripts
نویسندگان
چکیده
Accurate and robust character segmentation is a significant challenge in Bangla optical character recognition (OCR). The two main errors in segmentation are joining and splitting errors. To solve the problems of joining errors, several algorithms have been proposed in the literature, with varying degrees of accuracy. Few solutions have been proposed to handle the splitting error issue; however, the accuracy of these proposed solutions were not measured. In an actual implementation of the proposed techniques, we observe the presence of over segmented units. In this paper, we present a dissection based splitting error elimination method which solves the problem of over segmentation under a wide range of document images. Our methodology performs its tasks in two stages: we first concentrate on the careful clipping of the matraa (headline) and put our effort in keeping the pixel information of the units intact which are sensitive to splitting errors. In the second stage, we apply several rules based on the feature information of the units in a word. The combined performance of these two stages results in success rate of 99.93% in eliminating the splitting errors.
منابع مشابه
A Survey on Script Segmentation for Bangla OCR
Script segmentation is an important primary task for any Optical Character Recognition (OCR) software. Especially, in case of off-line OCR for printed character, it has more importance. Through script segmentation a big image of some written document is fragmented into a number of small pieces which are then used for pattern matching to determine the expected sequence of characters. In the impl...
متن کاملRecognition of Isolated Multi-Oriented Handwritten/Printed Characters using a Novel Convex-Hull Based Alignment Technique
Handwritten character recognition is one of the difficult tasks of pattern recognition due to diverse writing styles. The problem becomes more severe if the characters are written in a cursive fashion with varying orientations. Also there may exist printed characters of different shapes/fonts and sizes in a document image. In the current work, we have presented a novel convex hull based alignme...
متن کاملMachine-printed and hand-written text lines identification
There are many types of documents where machine-printed and handwritten texts intermixedly appear. Since the optical character recognition (OCR) methodologies for machine-printed and handwritten texts are dierent, to achieve optimal performance it is necessary to separate these two types of texts before feeding them to their respective OCR systems. In this paper, we present a machine-printed a...
متن کاملA survey on optical character recognition for Bangla and Devanagari scripts
Abstract. The past few decades have witnessed an intensive research on optical character recognition (OCR) for Roman, Chinese, and Japanese scripts. A lot of work has been also reported on OCR efforts for various Indian scripts, like Devanagari, Bangla, Oriya, Tamil, Telugu, Malayalam, Kannada, Gurmukhi, Gujarati, etc. In this paper, we present a review of OCR work on Indian scripts, mainly on ...
متن کاملHandwritten Segmentation in Bangla Script: A Review of Offline Techniques
Offline handwritten segmentation in Bangla is an interesting area of research as Segmentation has long been one of the most critical areas of optical character recognition process. Through this operation, an image of a sequence of characters, which may be connected in some cases, is decomposed into sub-images of individual alphabetic symbols. In this paper, segmentation of cursive handwritten s...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008